33. Case Studies of Multicore Architectures I

多核架构案例研究 I
AP SHANTHI博士

本模块的目标是讨论对多核处理器的需求，并作为案例研究讨论英特尔多核架构的显着特性。

在之前的许多模块中，我们研究了利用 ILP 的不同方法。ILP 是机器指令级别的并行性，通过它处理器可以重新排序、流水线指令、使用预测、进行积极的分支预测等。 ILP 使处理器速度快速提高了二十年左右。在此期间，晶体管密度每 18 个月翻一番（摩尔定律），频率增加使得管道越来越深。图 37.1 显示了发生的趋势。但由于温度升高，深度流水线电路必须处理产生的热量。设计如此复杂的处理器很困难，而且随着频率的增加，回报也越来越少。所有这些都使得单核时钟频率变得更高变得更加困难。功率密度增加得如此之快，以至于它们必须以某种方式被控制在限制之内。图 37.2 显示了功率密度的增加。

此外，从应用程序的角度来看，许多新应用程序也是多线程的。单核超标量处理器无法充分利用 TLP。因此，计算机体系结构的总趋势是转向更多的并行性。从复杂的单核架构到更简单的多核架构已经发生了范式转变。这是更粗略的并行性——明确利用 TLP。

多核处理器利用了功率和频率之间的基本关系。通过合并多个内核，每个内核都能够以较低的频率运行，从而在它们之间分配通常分配给单个内核的功率。结果是比单核处理器的性能有了很大的提高。下图 37.3 和 37.4 中的插图显示了这一关键优势。

如图 37.3 所示，将单个内核的时钟频率提高 20% 可实现 13% 的性能提升，但需要增加 73% 的功率。相反，将时钟频率降低 20% 可将功耗降低 49%，但只会导致 13% 的性能损失。

第二个内核被添加到上面的时钟不足示例中。这导致双核处理器在时钟频率降低 20% 的情况下，有效提供 73% 的性能，同时在最大频率下使用与单核处理器大致相同的功率。如图 37.4 所示。

多核处理器是一种特殊的多处理器，所有处理器都在同一芯片上。多核处理器属于 MIMD 类别——不同的内核执行不同的线程（多指令），在内存的不同部分（多数据）上运行。Multicore 是一种共享内存多处理器，因为所有内核共享相同的内存。因此，我们在 MIMD 风格的共享内存架构中讨论的所有问题也适用于这些处理器。

多核架构可以是同构的，也可以是异构的。同类处理器是指在芯片上具有相同内核的处理器。异构处理器在芯片上具有不同类型的内核。软件技术人员需要了解架构才能获得适当的性能提升。随着双核和多核处理器统治整个计算机行业，一些架构问题将与传统问题不同。例如，新的性能准则是每瓦性能。

我们将讨论不同的多核架构作为案例研究。将讨论以下架构：

• 英特尔

• 太阳的尼亚加拉

• IBM 的 Cell BE

其中，该模块将处理英特尔多核架构的显着特征。

Intel’s Multicore Architectures: Like all other partners in the computer industry, Intel also acknowledged that it had hit a ”thermal wall‘’ in 2004 and disbanded one of its most advanced design groups and said it would abandon two advanced chip development projects. Now, Intel is embarked on a course already adopted by some of its major rivals: obtaining more computing power by stamping multiple processors on a single chip rather than straining to increase the speed of a single processor. The evolution of the Intel processors from 1978 to the introduction of multicore processors is shown in Figure 37.5.

The following are the salient features of the Intel’s multicore architectures:

•      Hyper-threading

•      Turbo boost technology

•      Improved cache latency with smart L3 cache

•      New Platform Architecture

• 快速路径互连 (QPI)

• 智能电源技术

• 通过 PCI Express 2.0 和 DDR3 内存接口实现更高的数据吞吐量

• 改进的虚拟化性能

• 使用英特尔主动管理技术远程管理联网系统

• 其他改进 – 增加窗口大小、更好的分支预测、更多指令和加速器等。

我们将详细讨论它们中的每一个。

通过超线程优化多线程性能： Intel 于 2002 年在其处理器上引入了超线程技术。超线程将单个物理处理核心暴露为两个逻辑核心，允许它们在执行线程之间共享资源，从而提高系统效率，如图 37.6 所示。由于缺乏可以明确区分逻辑和物理处理核心的操作系统，英特尔在引入多核 CPU 时删除了此功能。随着 Windows Vista 和 Windows 7 等操作系统的发布，它们充分意识到逻辑内核和物理内核之间的差异，英特尔在酷睿 i7 系列处理器中带回了超线程功能。超线程允许在同一个物理 CPU 内核上同时执行两个执行线程。有四个内核，有八个执行线程。

CPU Performance Boost via Intel Turbo Boost Technology: The applications that are run on multicore processors can be single threaded or multi -threaded. In order to provide a performance boost for lightly threaded applications and to also optimize the processor power consumption, Intel introduced a new feature called Intel Turbo Boost. Intel Turbo Boost features offer processing performance gains for all applications regardless of the number of execution threads created. It automatically allows active processor cores to run faster than the base operating frequency when certain conditions are met. This mode is activated when the OS requests the highest processor performance state. The maximum frequency of the specific processing core on the Core i7 processor is dependent on the number of active cores, and the amount of time the processor spends in the Turbo Boost state depends on the workload and operating environment.

Figure 37.7 illustrates how the operating frequencies of the processing cores in the quad-core Core i7 processor change to offer the best performance for a specific workload type. In an idle state, all four cores operate at their base clock frequency. If an application that creates four discrete execution threads is initiated, then all four processing cores start operating at the quad-core turbo frequency. If the application creates only two execution threads, then two idle cores are put in a low-power state and their power is diverted to the two active cores to allow them to run at an even higher clock frequency. Similar behavior would apply in the case where the applications generate only a single execution thread.

For example, the Intel Core i7-820QM quad-core processor has a base clock frequency of 1.73 GHz. If the application is using only one CPU core, Turbo Boost technology automatically increases the clock frequency of the active CPU core on the Intel Core i7-820QM processor from 1.73 GHz to up to 3.06 GHz and places the other three cores in an idle state, thereby providing optimal performance for all application types. This is shown in the table below.

处理器处于特定 Turbo Boost 状态的持续时间取决于它达到热、功率和电流阈值的时间。有了充足的电源和散热解决方案，酷睿 i7 处理器可以在 Turbo Boost 状态下运行更长的时间。在某些情况下，用户可以通过控制器的 BIOS 手动控制活动处理器内核的数量，以微调 Turbo Boost 功能的运行，以优化特定应用程序类型的性能。

使用智能 L3 缓存改进缓存延迟： 我们都知道缓存是一块高速内存，用于临时数据存储，位于与 CPU 相同的硅片上。如果多核 CPU 中的单个处理内核在执行指令集时需要特定数据，它首先会在其本地缓存（L1 和 L2）中搜索数据。如果数据不可用，也称为缓存未命中，则它会访问更大的 L3 缓存。在独占 L3 缓存中，如果该尝试不成功，则内核执行缓存侦听（搜索其他内核的本地缓存）以检查它们是否具有所需的数据。如果此尝试也导致缓存未命中，则它会访问较慢的系统 RAM 以获取该信息。从缓存读取和写入的延迟远低于从系统 RAM 读取的延迟，

酷睿 i7 系列处理器具有最大可达 12 MB 的包容性共享 L3 缓存。图 37.8 显示了 Core i7-820QM 四核处理器的不同类型的缓存及其布局。它具有四个核心，其中每个核心具有 32 KB 的指令和 32 KB 的 L1 缓存数据，每个核心的 L2 缓存 256 KB，以及 8 兆字节的共享 L3 缓存。L3 缓存在所有内核之间共享，它的包容性通过减少到处理器内核的缓存监听流量来帮助提高性能并减少延迟。包含共享的 L3 缓存可确保如果出现缓存未命中，则数据位于处理器之外，并且在其他内核的本地缓存中不可用，从而消除了不必要的缓存监听。

New Platform Architecture: As shown in Figure 37.9, the previous Intel micro-architectures for a single processor system included three discrete components – a CPU, a Graphics and Memory Controller Hub (GMCH), also known as the Northbridge and an I/O Controller Hub (ICH), also known as the Southbridge. The GMCH and ICH combined together are referred to as the chipset. In the older Penryn architecture, the front-side bus (FSB) was the interface for exchanging data between the CPU and the northbridge. If the CPU had to read or write data into system memory or over the PCI Express bus, then the data had to traverse over the external FSB. In the new Nehalem micro-architecture, Intel moved the memory controller and PCI Express controller from the northbridge onto the CPU die, reducing the number of external data buses that the data had to traverse. These changes help increase data-throughput and reduce the latency for memory and PCI Express data transactions. These improvements make the Core i7 family of processors ideal for test and measurement applications such as high-speed design validation and high-speed data record and playback.

Higher – Performance Multiprocessor Systems with Quick Path Interconnect (QPI): Not only was the memory controller moved to the CPU for Nehalem processors, Intel also introduced a distributed shared memory architecture using Intel QuickPath Interconnect (QPI). QPI is the new point-to-point interconnect for connecting a CPU to either a chipset or another CPU. It provides up to 25.6 GB/s of total bidirectional data throughput per link. Intel’s decision to move the memory controller in the CPU and introduce the new QPI data bus has had an impact for single-processor systems. However, this impact is much more significant for multiprocessor systems. Figure 37.10 illustrates the typical block diagrams of multiprocessor systems based on the previous generation and the Nehalem microarchitecture.

每个 CPU 都可以访问本地内存，但它们也可以通过 QPI 事务访问其他 CPU 的本地内存。例如，一个 Core i7 处理器可以通过 QPI 访问另一个处理器本地的内存区域，或者通过一个直接跃点或通过多个跃点。QPI 是一种高速、点对点互连。图 37.11 说明了这一点。它提供高带宽和低延迟，提供释放新微架构所需的互连性能，并提供企业应用程序中预期的可靠性、可用性和可服务性 (RAS) 功能。RAS 要求通过高级功能得到满足，这些功能包括：CRC 错误检测、用于错误恢复的链路级重试、热插拔支持、时钟故障转移和链路自愈。检测时钟故障 & 自动重新调整宽度 &

它支持在不同组件之间同时移动数据。支持多种功能，例如通道/极性反转、数据恢复和去歪斜电路，以及简化高速链路设计的波形均衡。英特尔® QuickPath Interconnect 包括缓存一致性协议，以在系统运行期间保持分布式内存和缓存结构的一致性。它支持低延迟源监听和可扩展的家庭监听行为。一致性协议提供直接缓存到缓存传输以获得最佳延迟。

The QPI supports a layered architecture as shown in Figure 37.12. The Physical layer consists of the actual wires, as well as circuitry and logic to support ancillary features required in the transmission and receipt of the 1s and 0s – unit of transfer is 20-bits, called a Phit (for Physical unit). • The Link layer is responsible for reliable transmission and flow control – unit of transfer is an 80-bit Flit (for Flow control unit). The Routing layer provides the framework for directing packets through the fabric. The Transport layer is an architecturally defined layer (not implemented in the initial products) providing advanced routing capability for reliable end-to-end transmission. The Protocol layer is the high-level set of rules for exchanging packets of data between devices. A packet is comprised of an integral number of Flits.

智能电源技术：英特尔的多核处理器提供智能电源门，让闲置的处理器内核接近零功耗，从而降低功耗。与之前的 16 至 50 瓦相比，功耗降低至 10 瓦。支持自动低功耗状态，以将处理器和内存置于工作负载允许的最低功耗状态。

通过 PCI Express 2.0 和 DDR3 内存接口实现更高的数据吞吐量： 为了支持现代应用程序以更快的速度移动数据的需求，Core i7 处理器为外部数据总线及其内存通道提供了更高的吞吐量。新处理器采用 PCI Express 2.0 数据总线，将 PCI Express 1.0 的数据吞吐量提高一倍，同时保持与 PCI Express 1.0 的完全硬件和软件兼容性。x16 PCI Express 2.0 链路的最大吞吐量为 8 GB/s/方向。为了将来自 PCI Express 2.0 数据总线的数据存储在系统 RAM 中，Core i7 处理器具有多个 DDR3 1333 MHz 内存通道。具有两个 DDR3 1333 MHz RAM 通道的系统的理论内存带宽为 21.3 GB/s。此吞吐量与 x16 PCI Express 2.0 链路的理论最大吞吐量非常匹配。

改进的虚拟化性能：虚拟化是一种技术，可以在同一处理硬件上并行运行多个操作系统。在测试、测量和控制领域，工程师和科学家已经使用这项技术将离散的计算节点整合到一个单一的系统中。借助 Nehalem 微架构，英特尔在 Core i7 处理器及其芯片组中添加了硬件辅助页表管理和定向 I/O 等新功能，使软件能够进一步提高其在虚拟化环境中的性能。

使用英特尔主动管理技术 (AMT) 远程管理联网系统： AMT 为系统管理员提供远程监控、维护和更新系统的能力。英特尔 AMT 是英特尔管理引擎的一部分，它内置于基于 Nehalem 的系统的芯片组中。此功能允许管理员从远程媒体启动系统，跟踪硬件和软件资产，并执行远程故障排除和恢复。工程师可以使用此功能来管理需要高正常运行时间的已部署自动化测试或控制系统。测试、测量和控制应用程序能够使用 AMT 执行远程数据收集和监控应用程序状态。当应用程序或系统出现故障时，AMT 使用户能够远程诊断问题并访问调试屏幕。这允许更快地解决问题，并且不再需要与实际系统交互。当需要软件更新时，AMT 允许远程完成这些操作，确保系统尽快更新，因为停机时间可能会非常昂贵。AMT 能够为 PXI 系统提供许多远程管理优势。

其他改进：除了上面讨论的功能外，还引入了其他改进。它们列在下面：

• 更大的并行性

 通过增加窗口和调度程序的大小来增加可以乱序执行的代码量
 
• 高效的算法

同步原语的改进性能
更快地处理分支预测错误
改进的硬件预取和更好的 Load-Store 调度
 
• 增强的分支预测

新的二级分支目标缓冲区（BTB）
新重命名的返回堆栈缓冲区 (RSB)
 
• 新的面向应用程序的加速器和英特尔 SSE4（Streaming SIMD extensions-4）

启用 XML 解析/加速器
 
总而言之，我们在本模块中详细介绍了英特尔多核架构的显着特征。讨论了以下功能：

超线程
涡轮增压技术
使用智能 L3 缓存改善缓存延迟
新平台架构
快速路径互连 (QPI)
智能电源技术
通过 PCI Express 2.0 和 DDR3 内存接口实现更高的数据吞吐量
改进的虚拟化性能
使用英特尔主动管理技术远程管理联网系统
其他改进
 
网页链接/支持材料

Computer Architecture – A Quantitative Approach，John L. Hennessy 和 David A. Patterson，第 5 版，Morgan Kaufmann，Elsevier，2011。
英特尔白皮书：
https://www.pogolinux.com/learn/files/quad-core-06.pdf
英特尔技术/研究：http : //www.intel.com/technology/index.htm
节能性能：http : //www.intel.com/technology/eep/index.htm
英特尔酷睿微架构：http : //www.intel.com/technology/architecture/coremicro/
双核处理器：http : //www.intel.com/technology/computing/dual-core/index.htm
多核/多核：http : //www.intel.com/multi-core/index.htm
英特尔平台：http : //www.intel.com/platforms/index.htm
线程：http : //www3.intel.com/cd/ids/developer/asmo-na/eng/dc/threading/index.htm
用于测试、测量和控制的英特尔酷睿 i7 处理器的前八项功能：
http://www.ni.com/white-paper/11266/en/#toc1